-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add cuco::bloom_filter #573
Conversation
CC @kkraus14 |
9b8d32d
to
ad43a1f
Compare
Thanks for throwing this up @sleeepyjack! I've pinged some folks on my side to take a pass at reviewing the host and device APIs as well as general functionality here in order to provide some high level feedback as a starting point. |
Thanks for working on this. I know that we are really excited at the prospect of being able to use this to accelerate some of our workloads. I want to describe some of the ways we were hoping to use this.
We would like to be able to serialize and deserialize the underlying data of the bloom filter to be able to send it across the network, preferably something that doesn't require a copy of the structure since we use libraries like UCX which can send directly from the GPU. Not being able to do this would probably make this unusable for us because of our distributed use case. Another thing that might be considered is that in particular for testing if sets are disjoint is considering using partitioned bloom filters as described in these papers Understanding Bloom Filter Intersection You might also consider a contains API that doesn't write output to an iterator but rather to a boolean indicating whether or not the the bloom filter contains an element from from the input being offered and an API that can take two bloom filters and test if the sets that produced them are disjoint. so something like
Say you want to join two tables. One is 1TB another is 10GB. We would like to be able to make a bloom filter from subsets of the 10GB table and then use those bloom filters to test if subsets of the 1TB table are disjoint with those subsets of the 10GB table. There are two features we could benefit from to help enable this work. One is that we build bloom filters for all subsets of the tables (these might be distributed across many nodes) and then shuffle these around to test and see which subsets combinations can be ruled out for joining. So in this case its the same api I mentioned above where you can see if two bloom filters inputs were disjoint. The other is that when they are possibly not disjoint we could then apply the bloom filter row by row with APIs that seem to be already available in this PR.
Say someone is joining to tables on two columns, e.g. a.x = b.x and a.y = b.y The bloom filter we would want to make in this case would be one where the input was a combination of x and y rather than having to make two bloom filters one for each column. This would be better and lowering the false positive rate than if we were to test them seperately. |
Thanks for the valuable insights, @felipeblazing ! I can address some of the points right away:
This should be doable. We did a similar thing for our
Thanks for sharing those papers. I don't have access permissions so I requested them. From my rough understanding, a partitioned Bloom filter stores the signature for each key in
Yep, this one should be easy to implement. Naming-wise I would go with something like
I think this would already work by customizing the hash function similar to what cudf does with their |
cuDF has 32-bit and 64-bit hash_combine implementations. Usually, the crucial thing here is that you need a non-commutative function so that |
Hi @felipeblazing! This looks to me, based on the code as well as the PR message, like it is based on Apache Impala's Bloom filters, which in turn infiltrated Apache Arrow, Apache Parquet, and so on - at least as long as On the other hand, when For SIMD on CPU, sectorized/partitioned with a |
Very interesting point! Yes, you could apply the same approach to the blocked (or sectorized) Bloom filter in this PR.
GPUs follow a similar principle where the GPU's cache line size is either 32byte (a sector) or 4*32byte aka an L2 slice depending on how you look at it. In 9332c9a I was able to fix some performance issues for when the |
Would it make sense, then, for With some other small changes, this could also support using the Bloom filters from Parquet, KVRocks, and Impala (and forks Doris and StarRocks) without having to read the input keys and re-encode a new Bloom filter. I can imagine this might be of use mostly in the Parquet case. |
Address #573 (comment) This PR updates the existing code to use `cuda::std::byte` as the default device data type instead of `std::byte`. This change addresses potential issues where `cuda::std::` utilities cannot be applied to `std::byte` when relaxed constexprs are disabled.
include/cuco/bloom_filter.cuh
Outdated
class Extent = cuco::extent<std::size_t>, | ||
cuda::thread_scope Scope = cuda::thread_scope_device, | ||
class Policy = | ||
cuco::bloom_filter_policy<cuco::xxhash_64<Key>, cuda::std::array<std::uint32_t, 8>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably worth adding a strong type instead of using plain cuda::std::array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I also thought about this. The information needed here is the word type and block size (determined by typename Block::value_type
and cuda::std::tuple_size_v<Block>
, i.e., standardized facilities to describe a container's value type and static size). So technically any container that follows this concept can be used to describe the filter block parameters.
This type has a single use in cuco and since the Block
type itself is actually never used, i.e., we need it only to extract the block parameters, I would rather vote for replacing the Block
tparam in the policy with bloom_filter_policy<class Hash, class Word, uint32_t WordsPerBlock>
. WDYT?
Co-authored-by: Yunsong Wang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work. @sleeepyjack Thank you!
Forgot to post some final benchmark results:
(includes changes from #609) |
Superseeds #101
Implementation of a GPU "Blocked Bloom Filter".
This PR is an updated/optimized version of #101 and features the following improvements: